A Cross Training Corrective Approach for Web Pages Classification

نویسندگان

  • Abdelbadie Belmouhcine
  • Mohammed Benkhalifa
چکیده

Textual document classification is one challenging area of data mining. Web page classification is a type of textual document classification. However, the text contained in web pages is not homogenous since a web page can discuss related but different subjects. Thus, results obtained by a textual classifier on web pages are not as better as those obtained on textual documents. Therefore, we need to use a method to enhance results of those classifiers or more precisely a technique to correct their results. One category of techniques that address this problem is to use the test set hidden underlying information to correct results assigned by a textual classifier. In this paper, we propose a method that belongs to this category. Our method is a Cross Training based Corrective approach (CTC) for web page classification that learns information from the test set in order to fix classes initially assigned by a text classifier on that test set. This adjustment leads to a significant improvement on classification results. We tested our approach using three traditional classification algorithms: Support Vector Machine (SVM), Naïve Bayes (NB) and K Nearest Neighbors (KNN), on four subsets of the Open Directory Project (ODP). Results show that our collective and corrective approach, when applied after SVM, NB or KNN, enhances their classification results by up to 12.39%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Iterative cross-training: An algorithm for learning from unlabeled Web pages

The paper presents a learning method, called Iterative Cross-Training (ICT) , for classifying Web pages in two classification problems, i.e., (1) classification of Thai/non-Thai Web pages, and (2) classification of course/non-course home pages. Given domain knowledge or a small set of labeled data, our method combines two classifiers that are able to effectively use unlabeled examples to iterat...

متن کامل

A Comparative Study of Web-pages Classification Methods using Fuzzy Operators Applied to Arabic Web-pages

In this study, a fuzzy similarity approach for Arabic web pages classification is presented. The approach uses a fuzzy term-category relation by manipulating membership degree for the training data and the degree value for a test web page. Six measures are used and compared in this study. These measures include: Einstein, Algebraic, Hamacher, MinMax, Special case fuzzy and Bounded Difference ap...

متن کامل

Efficient Prediction of Cross-Site Scripting Web Pages using Extreme Learning Machine

Malicious code is a way of attempting to acquire sensitive information by sending malicious code to the trustworthy entity in an electronic communication. JavaScript is the most frequently used command language in the web page environment. If the hackers misuse the JavaScript code there is a possibility of stealing the authentication and confidential information about an organization and user. ...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Discovering Test Set Regularities in Relational Domains

Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the trainin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCSA

دوره 12  شماره 

صفحات  -

تاریخ انتشار 2015